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Abstract. The privacy preserving data mining (PPDM) has been one of the most 
interesting, yet challenging, research issues. In the PPDM, we seek to outsource 
our data for data mining tasks to a third party while maintaining its privacy. In this 
paper, we revise one of the recent PPDM schemes (i.e., FS) which is designed for 
privacy preserving association rule mining (PP-ARM). Our analysis shows some 
limitations of the FS scheme in term of its storage requirements guaranteeing a 
reasonable privacy standard and the high computation as well. On the other hand, 
we introduce a robust definition of privacy that considers the average case privacy 
and motivates the study of a weakness in the structure of FS (i.e., fake transactions 
filtering). In order to overcome this limit, we introduce a hybrid scheme that 
considers both privacy and resources guidelines. Experimental results show the 
efficiency of our proposed scheme over the previously introduced one and opens 
directions for further development. 

Keywords: privacy preservation, data sharing, association rule mining, resources 
efficiency, average and worst case privacy. 



1 Introduction 

The data mining is a powerful tool for discovering knowledge such like hidden pre- 
dictive information, pattens and correlations from large databases However, since 
the data itself may include information that can lead to user identification, the privacy 
preserving data mining (PPDM) has became of a great interest J2|- In the PPDM al- 
gorithms, not only the accuracy of the mining result but also the privacy of the data 
itself is considered [3j|. Since the first work by Agrawal et al. Q, several PPDM al- 
gorithms have been developed though the challenge of data privacy has not been to- 
tally solved. These algorithms are basically classified under two directions: crypto- 
graphic and non-cryptographic (i.e., randomization-based) algorithms J4). While it is 
believed that the cryptographic based approaches are computationally infeasible for 
most of the existing data mining models due to the large data size, the randomization- 
based algorithms suffer from the problem of their low accuracy 15161 . Though, the ran- 
domization based algorithms have been favored over the cryptographic algorithms and 
therefore several PPDM algorithms based on randomization technique have been in- 
troduced. These algorithms include data clustering 1718191101 . association rule mining 
111I12I13I14I15I16I17L data classification 118I19I20I2I . etc. 

One of the interesting, though challenging, data mining applications is the associa- 
tion rule mining (ARM) 121 1221 . The ARM is a well researched method for discovering 



interesting relations between variables in large databases. When adding the privacy 
concern to ARM, the privacy preserving association rule mining (PP-ARM) aims to 
discover such relations between the variables in the data while maintaining the data pri- 
vacy. To do so, several algorithms have been introduced including the aforementioned 
works in 01 II 121 1 31141 151161171 . 

One of these works (in fl4| and will referred through the rest of the paper as FS) 
considered adding fake transactions to anonymize the original data transactions in order 
to maintain their privacy. This work has several advantages over other existing schemes 
including that any off-the-shelf mining algorithm can be used for mining the modified 
data and the ability of providing a high theoretical privacy guarantee though being sub- 
ject to several limitations. In this paper, we revise the FS scheme and show several 
results: 

- We show an average case study of the privacy preservation in FS that better express 
the real privacy consideration. 

- In order to provide a high privacy measure, the FS scheme requires an exhaustive 
amount of storage. Even for same level of privacy with other existing schemes such 
like PS lfTTl . FS scheme still requires higher storage (section|4|i. 

- In practice, the privacy provided by the FS can be breached given that the original 
transactions are not modified and kept in the released modified data. Similarly, the 
fake transactions since they are larger in number than the real transactions in most 
cases can be filtered and affect the overall attained privacy (sectional. 

- Also, to take advantage of the FS and reduce its memory requirements, we intro- 
duce a hybrid scheme that utilizes both FS and PS schemes (section|6]l. 

- We introduce a thorough theoretical and experimental analyses that demonstrates 
the achieved properties of both the revised and hybrid schemes. 

The rest of this paper is organized as follows: section|2]introduces the preliminaries, 
definitions and notations. Section [3] details the procedure of the PP-ARM using the 
fake transactions FS scheme, section [4] introduces the first part of our contribution by 
revisiting the FS scheme, and section [5] lists some remarks motivating the need for 
hybrid scheme, describing the PS scheme (the MASK), and comparing it to the FS 
scheme. Section[6]introduces our hybrid scheme and it properties over other schemes in 
term of privacy, resources, and error (in both analytical and experimental formulations). 
Finally, section|7]draws concluding remarks. 

2 Preliminaries and Definitions 

2.1 Why does privacy matter? 

In order to illustrate the importance of the privacy when considering data mining, we 
provide several examples. These examples are recalled from the health, marketing, and 
law areas. 

Example 1 (Health care system). A hospital would like to release health care data for 
external research purposes. However, insurance companies (the attacker) are interested 



in knowing the health record of the patients and their parents (privacy). Given that if 
somebody's parents have a specific disease then the they (i.e., the children) may have 
the same disease with high probability, they insurance companies may increase the 
insurance of the children in order to guarantee a high margin of profit. 

Example 2 (Marketing and competition). A retailing company would like to know the 
pattern of customers choice and future directions from a given marketing records that it 
already has. One of the possible options for that company is to outsource its own data 
to a third party that performs the mining task and discover any interesting patterns and 
provide them back to the company. While this data is not important for many people, 
it would be important for other companies which competing on the same market (the 
attacker). Therefore it is required to provide an image of the data that can imply the 
required task without revealing additional information to the third party. 

Example 3 (Regulations and laws). According to several currently applied regulations 
and laws, personal data is preserved and can not be stored permanently or used for mak- 
ing decision by other party. Specially, as data mining algorithms build decision on data 
patterns, it is hard to remove the bias of decision based on gender or race. An example 
of such regulations includes HIPAA (Health Insurance Portability and Accountability 
Actfl 

2.2 Major Notation 

- FS: the PP-ARM algorithm using fake transactions in fl4l . 

- PS: the PP-ARM algorithm using data masking in [1 Q. 

- P r PS : reconstruction probability when using the PS algorithm. 

- P r FS : reconstruction probability when using the FS algorithm. 

- Pp S : quantification of preserved privacy when using PS algorithm. 

- Pp S : quantification of preserved privacy when using the FS algorithm. 

- w, Wi , W2- general parameters used for the ARM with fake transactions to represent 
the ratio of fake to real transactions. 

- R\,Rq: reconstruction probability of ones and zeros in PS respectively. 

- a: privacy parameter in PS scheme which determines the ratio according to which 
ones and zeros are handled. 

Note that other notations are defined and used in the context of this paper as well. 

2.3 Data Model 

The market basket model is used for the ARM @. In the market basket, each user par- 
ticipates with a tuple (also called transaction) in the database where the data tuples are 
of fixed length as a sequence of '0' and '1'. The columns in the database represent 
the products (i.e., items) where the existence of T in the tuple indicates a purchase 

1 www.hhs.gov/ocr/hipaa/ 

2 Note that this model is figurative where the applications is not limited to data driven from 
market model but any other models as well (see the above examples). 



of the specified product and the existence of '0' indicates no purchase. Since the users 
normally buy a smaller fraction of products than the whole number of products in the 
market, the number of Ts is much fewer than the number of '0's. The goal of the 
mining process is to compute the set of association rules in the database that satisfy 
a specific criterion. For general representation, the data can be represented as follows 
El: 
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where = 1 if and only if the item i of the user j is selected (marked, bought, 
access, etc) or equal to otherwise. 



2.4 Definitions 

Definition 1 (association rules). Hffll Let the whole itemset be I = {a\ , a 2 , a 3 . . . , a„ } 

and T is a set of N transactions where T = {t\, t%, ■ ■ ■ , £jv} where each transaction ti 
is a subset of I. The association rule is a statistical implication which can be expressed 
as follows: X^Y where X,Y C I, X DY = <j>. 

The association rule X Y is said to have a support s if X D Y appears in s% of T. 
Also, the association rule is said to have c confidence if c% of the T that satisfy X also 
satisfy Y. While the support is a measure of the significance of the association rule, the 
confidence is used as a measure of strength. Also, an association rule is of interest if 
both c and s are greater than some threshold. According to the Apriori mining algorith, 
finding the association rule in a dataset is equivlant to finding the frequent itemsets in 
that associations rule. An itemset is frequent if its support is greater than a threshold. 
Formally, the support of the itemset is defined as follows: 

Definition 2 (Support of Itemset). H14V Let A be a set ofn items where I = {ai , a.2, as . . 

and T is a set of N transactions where T = {t i , t% , . . . , t at } where each transaction ti 
is a subset of I. The support of A is defined as follows: 

T #{t£T\ACt} 
supp {A) = — (2) 

Example 4. Let the items be I={m, c, p, b, j}, and the minimum support be s m i n = 3. 
Also, let the set of transactions (tuples) be t-y ~ tg shown as follows 



ti={m, c, b} 
t 2 = {m, p,j} 



h = {m, b} 
U= {c,j} 



1 5 = {m, p, b} 

1 6 = {m, c, b,j} 



ty = {c,b,j} 
is = {b, c} 



From the transactions, we can systematically derive the representation matrix in 
terms of ones and zeros representing the existence and absence of a specific item in 
each transaction. 



By applying the support model in (f2]i on the above data matrix, we obtain the fol- 
lowing frequent itemsets and their support respectively: {m}, {c}, {b}, {j}, {m, b}, {c, 
b}, {j, c} and their supports are §, |, §, f , f , §, and §. 

Definition 3 (Privacy measure). /74]/ The privacy is defined as the probability accord- 
ing to which the distorted data can be reconstructed. 

Definition 4 (False positive cr + ). 4771/ This false positive estimation happens when 
k—itemset with a support slightly less s m i n is supported with more FT than other 
k— itemsets (included). 

Definition 5 (False negative o + ). 4771/ This false negative estimation happens when 
k—itemset with a support slightly greater than or equal s m i n is supported with less FT 
than other k— itemsets (discarded) 

3 Association rule mining with fake transactions 

Unlike the previously introduced scheme by Evfimievski et al [Q~3 1, which is per-transaction 
noise addition scheme, the ARM using fake transactions scheme lfT4l (PS for brevity) 
adds fake transactions as a mean of noise in between of the real transactions in the 
database. The privacy in FS is determined by the quality and quantity of the fake trans- 
actions added in between of the real transactions. The quantity of fake transactions is 
determined according to the parameter w which represents the ratio of fake to real trans- 
actions and the parameter / which determines the average length of the fake transactions. 
The parameter I is chosen to be same as the average length of the real transactions and 
the parameter w is chosen based on the desirable quantification of privacy to be at- 
tained (P p FS ). The P p FS can be expressed in terms of the hardness of filtering the the real 
transactions from the fake transactions (P r FS ) given as 
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The Pp s is then given as P^ s = 1 — _P r FS = 1 — . Technically, the FS scheme 
consists of two parts which are the data anonymization and the data mining parts. For 
the data anonymization, the following procedure is performed: 

1 . Determine U as a realization of uniformly distributed random variable with mean I 
that is equals to the average length of the real transactions (i.e., 1 < ij < 21 — 1). 

2. Determine w^' as the number of fake transactions to be inserted between two real 
transactions (specifically for two real transactions with index i and index i + 1 in the 
real database). For a predefined w (i.e., mean), w^' is determined as a realization 
of a uniformly distributed random variable with mean w (i.e., 1 < w; < 2w — 1). 

3. li number of items are selected from / to construct a fake transaction. 

4. The process is performed for times for the current insertion. 

5. The w( l > number of fake transactions generated above are inserted between the real 
transactions with indexes i and i + 1. 

The above steps are performed for the next pair of tuples (i.e., N — 1 times) for the N 
tuples in the database. For the data mining part (i.e., learning the association rules from 
the anonymized data), the following steps are performed: 

- The new minimum support of a transaction of fc-itemset in the list of anonymized 
transactions T is computed. 

- Using any off-the-shelf algorithm (such like the apriori algorithm), the association 
rules are driven according to the new minimum support. 

The procedure of computing the new minimum support is driven according to the fol- 
lowing steps: Given a fake transaction t of length Y and fc-itemset A, the probability 
that t supports A is: 



= r<n k = T^Ti (when Y > k and otherwise) (4) 

The number of fake transactions that support fc-itemset is approximately given as 
follows 



x - WN VV (5) 

^q? 21-1 cuu-i)h 

Assume the support of A e T is s (i.e., supp T (A) — s ), then the number of 
transactions in T that support A is s (l + w)N. Therefore, the number of real transac- 
tions that support A in T is given as follows: 

k v ' Y=k 



If we consider the real support to be s, then it is possible to write the above formula 
as s = s (1 + w) — c"'(2Z~i) Y^Y~=k • Therefore, we can write the new minimum 
support as follows: 



j w \r^2l — 1 f~iY 

min T cj-(2i-l) Z^Y=fc ^fe 



Since all of the parameters in (O are known, it is then easy to learn the associa- 
tion rules in the anonymized transactions T given the minimum support s m i n in the 
unanonymized set of transactions T. For further details on the FS scheme and its opti- 
mization, please refer to Ifl4l . 



4 Privacy preserving association rule mining revisited 

In this section, we revisit the aforementioned FS scheme and introduce three main re- 
sults which are as follows: (i) First, we show that the FS scheme is resources exhaustive 
(specially in terms of its requirements for high memory in order to provide a reasonable 
level of privacy), (ii) we show that the theoretical quantification of the privacy in the FS 
follows the worst-case study while the aver can can be better descriptor for the privacy 
quantification. We derive a general formula for the average case quantification, and (iii) 
we show that using two round attack where the first attack is done by applying common 
filters on the data and the second by the random selection, we show that the privacy can 
be less than the above two cases. 



4.1 Requirements analysis of the FS scheme 

The privacy of the FS scheme is merely dependent on the parameters I and w. While the 
first parameter does not have any effect on the required memory, the second parameter 
which is the determinant factor of the privacy (according to (01) has a great effect. The 
privacy attained by the FS scheme is defined as s = 1 — P r FS = 1 — -j^H. In order to 
attain a relatively high privacy, w need to be high. For example, to achieve a privacy of 
90% (i.e., 0.9 on the 1-scale), w need to be at least 11. That is, the required additional 
memory (as one mean of resources) for representing and storing the fake transactions 
in T will be 11 times of the original database size. To illustrate the growth of such 
functions, Fig. [TJ shows different growth regions. In Fig 1 1 (a)| the growth is shown for 
< w < 1 which reflexes the fast growth region attaining 0.5 privacy (i.e., 50%). Fig. 
|l(b)| shows the range of < w < 10 from which we obtain that an increment of 9 
in w leads to only 0.4 additional privacy preservation form the case of w = 1 (i.e., 
overall preservation 1). Finally, for the 10 < w < 100, Fig. |l(c)| shows that the change 
of w by 90 would add a privacy preservation of 0.04 to accumulate 0.99 for the overall 
w = 100. 

3 Though the function growth may not express the real requirements of the memory, its being 
with 0{ — ) growth function is a clear indicator that the privacy grows slower as w grows larger 



(a) < w < 1 (b) < w < 10 (c) 20 < w < 100 

Fig. 1. The attained privacy versus the required w that reflex the required overhead in 
terms of memory and computation. 



As we early mentioned, the parameter w directly affect the required resources in 
term of memory and computation. While the memory part is illustrated above, the re- 
quired computation linearly depends on the size of the dataset in which the association 
rules to be learned. That is, the increment of the database size in T will require it) times 
computational power more than the case of the association rule discovery in T only. 



4.2 Average-case for privacy quantification 

The privacy attained in the FS scheme according to the description in lfT4l is referred 
to as the worst-case privacy. The worst privacy is driven by assuming that the recon- 
struction probability of any tuple in the anonymized database T is equal to the recon- 
struction probability of the first (thus the worst) tuple. In other words, the probability of 
all tuples is assumed to be equal. However, since the attacker is assumed to reconstruct 
tuples successively without replacement, the necessity for defining an average case pri- 
vacy exists. In the following (theorem[T]i, we define the average-case privacy and show 
its relation to the worst-case privacy in lfl4l . 

Theorem 1 (average-case privacy). The quantification of privacy in H14]I considers 
the best reconstruction probability of a single record (i.e., worst case privacy measure) 
while the real privacy preserved (at average) is greater than the worst case quantifica- 
tion. 

Proof. Consider an adversary A interested in obtaining the whole set of real transac- 
tions by applying a random selection process. For the sequence of trials to obtain the 
transactions t\ . . .tN £ T , the following is the probability for successful reconstruc- 
tion of the N real transactions anonymized in the set ofwxJV fake transactions. 



Pr = 



N 



N 



N - 1 



wN + N wN + N - 1 



N-(N-l) 
wN + N - (N - 1) 



= Jj x ( po +Pl H l-PAT-i) 



(8) 



Then, it is easy to verify that pi > Pi+i for 1 < i < N — 1. Take for example £ = 1 
then > By multiplying both sides by 32*^=1, we get "ff+jf- 1 > 

which is valid for any w > and N > 2 (note that these conditions are always 
valid under the real data assumptions). We can similarly extend the above result to any 
£ > 1 and say that c x pi > Yl^oPi+j f° r an y * — 1 an d c > 1. That is (as a 
special case by substituting i = 1 and j = N — 1), TV x > J^ilo 1 P« which means 

Pi > 37 Y^JvPi- However, i Eilo 1 ? 5 ' = ^ and Pi = ^*r S - Then ' P r S > p r- From 
the final result, we get that. 



p; b > P r 

1 - P, FS < 1 - P r 

Pp S < Pp 5 ' (9) 

where P p FS and P p FS are the quantification of privacy preserved in the FS scheme in 
lfT4l and at average case introduced by us, respectively. □ 




Fig. 2. The average versus the worst case privacy preservation 



Note that the last result of the average-case privacy quantification is more general 
and better express the real situation of the privacy attained according to the definition 
in lfT4l . Specially, this privacy is more suitable for modeling the attack below. 



4.3 On fake transactions filtering 



The main concern in lfT4l has been the filtering (and therefore the reconstruction) of the 
real transactions inserted in between of the fake transactions. However, an adversary 



A might be interested in removing some of the fake transactions which are obvious in 
order to maximize the chances of obtaining the real transactions in the remaining set of 
transactions according to the aforementioned privacy quantification model. 

The above is possible because, practically, it is not possible to generate fake trans- 
action that typically resemble the distribution of the the original data. This is specially 
obvious when the distribution of the the dataset is unknown or biased. This shortcoming 
opens a great chance for filtering the weak fake transactions using many off-the-shelf 
statistical tools. Moreover, given additional information on the distribution of the user 
choice in the data it is further possible to filter high amount of fake transactions. Gen- 
erally speaking, however, the filtering may take one, or even both, of the following 
strategies: 

- Random filtering: since the number of the fake transactions in T is greater than 
the number of real transactions, specially when w > 1, then it is more likely to 
select a transaction at random such that the selected transaction belongs to the set 
of fake transactions. 

- Guided filtering: given enough information to A about the distribution of the real 
and fake transactions and the choice of users (in general), A can easily (with high 
certainty) filter a large amount of the fake transactions. 

In order to study the impact of this filtering on the quantified privacy preservation, 
let the efficiency of the filter applied on T be 7 where < 7 < 1. Then, it is easy to 
extend the result in ([8]l to the following: 
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(l--f)wN + N (1 - -f)wN + N - 1 
Similarly, we can define the new average-case privacy (given 7) as 



N-(N-l) 
(1 — r y)wN + N — (N - 1) 
(10) 



p r ^(l- 7 )wN + N-i y ' 

To illustrate the impact of the filtering on the privacy preservation, Table Q] shows 
the quantified privacy preservation for different filtering efficiency parameters 7 and 
different values of w. 



5 Remarks and Extensions 

Obviously, the FS scheme introduces some great properties and, yet, suffers from some 
drawback which are summarized as follows 

- Unlike other schemes (such like the PS scheme), the FS scheme introduces the- 
oretically high privacy given enough resources (i.e., computation and memory). 
Though, such resources are a drawback for high privacy. 



Table 1. qunatified privacy preservation under several filtering efficiency factors. 
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- The presence of the (bare) real transactions in between of the fake transactions 
enables a great chance of real/fake transactions filtering leading to reduction of the 
privacy. 

Based on that, there is a great chance to utilize and extended version of the FS scheme 
that maintain its advantages and reduces (or overcomes) its disadvantages. Here, we 
recall another scheme of PP-ARM from the literature (PS) and explain how a hybrid 
scheme of both the PS and FS (referred as HS) will maintain the aforementioned goals. 



5.1 MASK for privacy preserving association rule mining 

The distortion of the data using the MASK scheme (i.e., PS scheme) is very simple 
when applied on a database defined according to the above model of the market basket 
(i.e., ([T). To preserve the privacy, the data owner performs the following: 

- Each tuple in the database is considered as a random variable X — {Xi} where 
Xi = or 1. 

- The distortion follows the following procedure: Y = distort(A) where Yi = Xi 
fi where f is complement of n which is a realization of a random variable with 
the probability distribution function f(r) = bernoulli(p) for < p < 1. 

The implication of such random variable is that r, takes a value ' 1 ' with probability p 
and '0' with probability 1 — p. For the case of r% = 1 the original bit X, in the data 
tuple is kept same (with probability p) and for the case of r% = the original bit Xi is 
altered to its complement. On the other hand, the privacy of the PS scheme is estimated 
by the probability according to which the reconstruction of zeros and ones is possibkfl 

1. Reconstruction of ones according to R\ = P r {Yi = = l}P r {JQ = l\Yi = 

l}+P r {Y l = 0|Xj = \}P r {Xi = l\Y t = 0} = - xp+( i _ X S o) x (l-p) + s x (l-l) + (l-s )xp ■ 

2. Reconstruction of zeros according to Rq = P r {Yi — l\Xi = 0}P r {Xi = 0|Yi = 
l}+P r {Yi = 0\X t = 0}P r {Xi = OIY; = 0} = (1 _ So ( ) 1 X p+^ x (i-p) + So ip+a-^o) x(Lp) ■ 



4 Note that this definition for the privacy is better the previous one since it implies an average- 
case reconstruction per bit. 



The overall probability of reconstruction is given as follows. 

P PS = aRt + (1 - a)R (12) 

Where a is a privacy parameter (for more details on the derivation, refer to ifTTl ). 
The amount privacy preserved is given as follows: 

P p PS = 1 - PP = 1 - (aRt + (1 - a)Ro) (13) 

The miner simply compute the minimum support s m j„ for all candidate in the ran- 
domized tuples that maps to the same original tuple requiring only a linear number of 
counters. That is, the computation overhead linearly dependent on the size of the dataset 
and the length of the each itemset (in the worst casefl 

5.2 Comparison 

In this section, we compare the two aforementioned schemes and point out their strength 
and shortcomings. Obviously, the PS scheme requires no memory overhead (apart from 
the required from representing the data itself) while the FS scheme requires memory 
space for the additional wN number of fake transactions used to hide the real trans- 
actions. Such memory can be tens of gigabytes for an ideal database limiting the later 
schemes feasibility and applicability. 

The PS scheme has an upper bound for the quantified privacy. That is, for the max- 
imum possible p, the attained privacy is equal to 89%. While this is possibly sufficient 
for some applications, for many privacy critical applications this would be a a great 
enough breach fT3l . On the otherh and, the overhead in the FS scheme is merely depen- 
dent upon the allowed amount of overhead. 

Both schemes excessive privacy results in a relatively higher error of the mining 
algorithm. Also, while the PS scheme requires modification in the mining algorithm to 
maintain a reasonable computation overhead, the FS scheme can use any off-the-shelf 
algorithm for mining. Table|2]shows a concluding comparison between the two schemes 
above. 

Table 2. Comparison between the FS and PS schemes 



Feature 


PS scheme 


FS scheme 


Memory Overhead 





0(wN). 


Computation 


~ N 


~ wN 


Mining Algorithm 


Modified 


off-the-shelf 



5 Also this is considered an additional merit of the PS over the FS. Further optimization tech- 
nique is shown in 1241 as well. 



6 Hybrid scheme for association rules 



Our scheme utilizes the two introduced schemes above to have their advantages to- 
gether and reduce from their disadvantages specially related to the memory overhead 
and limited privacy. 



6.1 HS for PP-ARM 

Our hybrid scheme (HS from brevity) works as follows: first fake transactions are pro- 
duced using the same way of the FS scheme and inserted in between of the real trans- 
actions for the whole set of transactions in the database then the modified database 
is distorted using the procedure of the PS scheme. The scheme is detailed as follows 
(analysis is omitted): 



pHS def pFS pPS _ n4 \ 
r r 1+w V > 

HS det _ HS = l _ Jj_ (15) 
P r 1+W K } 

since both P FS and P PS are less than zero, the resulting probability Pp° tal is always 
greater than either of the two probabilities. 



6.2 Measures and Metrics 

To study the characteristics of the HS scheme, we use the following three criteria (1) 
Privacy measure (Lemma [TJ, (2) Error measure, (3) Overhead measures in terms of 
computation and memory (Lemma|2j. 

Lemma 1. The quantified privacy preserved using our hybrid scheme HS is higher 
than the preservation using either the PS or the FS alone. 

Proof (sketch). Given that < P FS < 1 and < P PS < 1 then it is trivial to see 
that P FS P PS < P FS and P FS P, PS < P PS . That is, 1 - P FS P PS > 1 - P FS and 
1 - P FS P r PS > 1 - P PS which gives P p HS > P FS and P p HS > P PS respectively. □ 

As a special case, it can be easily shown that our schemes' attained privacy is higher 
than PS scheme when P^ s equals to its maximum value (i.e., minimum P PS ). 

Lemma 2. For same privacy level, our HS scheme requires less storage than FS scheme. 

Proof. Let W\ and w-2 be two parameters defined for FS and HS schemes respectively. 
The privacy attained by each scheme is given as P^ s = 1 — 1 _^ Wi and P^ s = 1 — ^ ■ 
By setting P^ s = P^ s (i.e., attained privacy is equal in both schemes) we get that: 

p S 1 +W 2 
1 + l«i 

However since P PS is less than 1 (more specifically, maximum P PS is equal to 0.89), 
the above equality is only possible when w-i < wi . □ 



Table 3. Error of mining in terms of false positive <r + and false negative a~ for HS 
versus FS considering different parameters w and for p = 0.5 and different minimum 
support values. 





Smin = 0.005 


Smin = 0.0025 


s min = 0.001 


scheme 


w 


privacy 


a+ 


a 


a+ 


a 


a+ 


a 


HS scheme 


2 


0.833 


4.013 


2.728 


2.341 


2.340 


2.172 


1.503 


FS scheme 


2 


0.667 


2.985 


1.493 


1.607 


1.607 


1.102 


0.701 


HS scheme 


4 


0.900 


6.731 


4.275 


4.762 


3.698 


1.591 


1.620 


FS scheme 


4 


0.800 


4.975 


2.985 


3.214 


2.501 


1.027 


1.152 



Example 5. For example, to attain a privacy P^ s = P^ = 0.95 when P 7 b = 0.3, it is 
enough to set W2 = 5 while wi must be at least 19 

For the part of the error measurement, represented by false positive and false neg- 
ative, we perform the experiment on the dataset BMS-WebView-1 ll22l . The used 
dataset consists of 59602 transactions where each consists of 497 items and the length 
of transaction at average (i.e., I) is equal to 2 l22l . We further set w with two values: 
2 and 4 generating fake transactions according to the procedure in [3] and set p = 0.5 
according to which the privacy of PS scheme is determined. The measurements for the 
error is shown in Table [3] 

7 Conclusion and Future Works 

The privacy preservation association rule mining (PP-ARM) is a critical issue of re- 
search where several are proposed for computing the support of itemset in a randomized 
dataset considering different randomization techniques. In this paper, we revisited the 
PP-ARM using fake transactions and showed three major results. We first redefined the 
privacy to include the average case consideration. We then pointed out the exhaustive 
requirements of the FS in terms of memory and computation. We further pointed out 
a drawback of the FS in practice by showing it weakness against the fake transactions 
filtering. In order to avoid such limitations of the FS, we extend it to a hybrid scheme 
with the PS scheme and show in both analytical and experimental result the attained 
properties. 

In the near future, it will be interesting to investigate the derivation of concrete 
error measures (in term of false negative and false positive). Also, we will consider 
experimentation over datasets with different parameters (i.e., I, n, and N). 
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